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ABSTRACT 

Graphite web is a novel web tool for pathway 
analyses and network visualization for gene expres- 
sion data of both microarray and RNA-seq experi- 
ments. Several pathway analyses have been 
proposed either in the univariate or in the global 
and multivariate context to tackle the complexity 
and the interpretation of expression results. These 
methods can be further divided into 'topological' 
and 'non-topological' methods according to their 
ability to gain power from pathway topology. 
Biological pathways are, in fact, not only gene lists 
but can be represented through a network where 
genes and connections are, respectively, nodes 
and edges. To this day, the most used approaches 
are non-topological and univariate although they 
miss the relationship among genes. On the 
contrary, topological and multivariate approaches 
are more powerful, but difficult to be used by re- 
searchers without bioinformatic skills. Here we 
present Graphite web, the first public web server 
for pathway analysis on gene expression data that 
combines topological and multivariate pathway 
analyses with an efficient system of interactive 
network visualizations for easy results interpret- 
ation. Specifically, Graphite web implements five 
different gene set analyses on three model organ- 
isms and two pathway databases. Graphite Web is 
freely available at http://graphiteweb.bio.unipd.it/. 

INTRODUCTION 

The complexity of the regulatory mechanisms of the cell 
combined with the difficulties in the interpretation of high- 
throughput 'omic' data has led to the development of a 
myriad of novel computational methods for data manage- 
ment, analysis and integration. 

Two approaches can be recognized: (i) those based on 
unsupervised approaches where gene expression data 
is combined with protein-protein and protein-DNA 



interactions networks to the identification of modules or 
subnetworks of the entire network (1-4) and (ii) those 
based on supervised approaches where predefined gene 
sets are tested to evaluate their involvement in a specific 
experimental condition (5-15). The focus of this work is 
on this second type of approaches. 

The aim of the supervised methods is to identify sets of 
genes with coordinated expression and/or concentration 
changes in different biological conditions, unravelling 
the complexity of cellular regulatory processes. In this 
context, the use of pathways as gene sets is of great help 
in simplifying the complexity and the interpretation of 
gene expression measurements. 

Gene set analyses can be subdivided into the classical 
enrichment analysis, working on gene lists selected 
through a gene-level test, and the novel global and multi- 
variate approaches that define a model for the whole 
gene set. 

In general, these two approaches have two fundamen- 
tally different null hypotheses. The first type hypothesizes 
that a given gene set has the same level of association with 
a phenotype as the rest of the genes. The second type only 
considers the genes within a gene set and hypothesizes that 
there is no gene in the gene set associated with the pheno- 
type (6). These two approaches have been termed 'com- 
petitive' and 'self-contained', respectively (16). It is worth 
noting that multivariate approaches can be competitive 
or self-contained, and conversely, competitive can be 
multivariate. 

The main drawbacks with 'competitive' methods are 

(i) the assumption that genes are independent along with 

(ii) the use of a cut-off threshold for the selection of dif- 
ferentially expressed genes (DEGs). In this way, many 
genes with moderate but meaningful expression changes 
are discarded by the strict cut-off value, which leads to a 
reduction in statistical power. On the other hand, global 
and multivariate approaches relax the assumption of in- 
dependence among genes of the same gene set and identify 
possibly moderate, but coordinated, expression/concen- 
tration changes that cannot be detected by the previous 
approaches without depending on any arbitrary cut-offs. 

A biological pathway is not a mere list of genes 
but represents the biologic relations between the 
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macromolecules within a cell. They can be represented 
through graphs where genes and their relations are, re- 
spectively, nodes and edges. As a result, pathway 
analyses can be further divided into 'topological' and 
'non-topological' depending on their ability to gain 
power from the information stored in the graph. Nearly 
all gene set analyses consider pathways as a simple gene 
list, ignoring the topological information. The reason for 
this is 2-fold: (i) the difficulty of retrieving the information 
of pathway topology and converting it to a gene network, 
(ii) the difficulty of including graph topology within stat- 
istical models. 

Pathway annotation comprises chemical compounds 
mediating interactions and different types of gene groups 
(e.g. protein complexes or gene families) that are usually 
represented as single nodes but whose measures are not 
available using microarray or RNA-seq data. It is there- 
fore necessary to convert pathways from their native 
format to gene-only networks. Our group has recently de- 
veloped 'graphite' (17), a Bioconductor package that 
taking the information from different databases, interprets 
pathway formats and reconstruct the correspondent gene- 
gene networks following specific biologically driven rules, 
'graphite' (17) gives the unprecedented possibility to use 
pathway topology for gene expression data analysis. To 
address the issue of considering graph topologies within 
statistical models, we recently proposed a totally new 
method for topological pathway analysis, called 
CliPPER (10,18). CliPPER is a two-step empirical 
approach based on Gaussian graphical models, which 
identifies pathways with means or covariance matrices sig- 
nificantly different between experimental conditions. It 
also selects the portions of the pathway, called signal 
paths, which are associated the most with the given 
phenotype. 

To this day, the most used approaches remain based 
on non-topological and univariate methods. These 
approaches completely miss the relationship among 
genes (6-9 among others) but offer an intuitive 
result interpretation. Topological and multivariate 
approaches are on the contrary more powerful, but are 
also difficult to use by researchers without bioinformatic 
skills. 

Here we present Graphite web, the first public server for 
topological-based pathway analysis based on high- 
throughput gene expression data analyses. Graphite web 
combines topological and multivariate pathway analyses 
with an efficient and interactive system of network visual- 
izations that allows an easy results interpretation. 
Specifically, Graphite web deals with microarray or 
RNA-seq data. It implements different multivariate gene 
set analyses [classical hypergeometric enrichment, global 
test (7), gene set enrichment analysis (GSEA) (6), 
signalling pathway impact analysis (SPIA) (5), CliPPER 
(10) on three model organisms (human, mouse and dros- 
ophila] and two pathway databases [KEGG (19) and 
Reactome (20)]. The implementation of different types 
of analysis will open up to the user the significant possi- 
bility to directly benchmark the performances of different 
algorithms on her data. 



WEB TOOL IMPLEMENTATION 

Graphite web has two sections: (i) mapping and inter- 
active browsing of pathway networks and (ii) pathway 
analysis using gene expression data from either microarray 
or sequencing technology. 

Before giving the details of these two separate sections, 
we briefly introduce the way 'graphite' converts pathway 
topology into gene-only networks. 

Pathway topology conversion, visualization and web 
implementation 

Pathway annotations comprise a myriad of interactions, 
reactions and regulations, which are often too rich to be 
represented in a network. Challenges are posed in particu- 
lar by the presence of chemical compounds mediating 
interactions and by different types of gene groups (e.g. 
protein complexes or gene families) that are usually rep- 
resented as single nodes. 

The core of Graphite web is 'graphite' (17), a 
Bioconductor tool recently developed by our group for 
the storage, interpretation and conversion of pathway 
topology to gene-only networks, 'graphite' discriminates 
between different types of biological gene groups and 
propagates gene connections through chemical com- 
pounds. Specifically, protein complexes are expanded 
into a clique (all proteins connected to the others), while 
the gene families are expanded without connections 
among them; see (17) for more details. 

Chemical compounds are not usually measured with 
high-throughput technologies; however, pathway annota- 
tions contain several compound-mediated interactions 
(interactions for which a compound acts as a bridge 
between two elements). As the trivial elimination of the 
compounds strongly bias the topology, 'graphite' takes 
into account cell compartment membership and propa- 
gates the signal connecting the compound-mediated 
elements [see (17) for more details]. 

Graphite web uses the gene-only networks derived from 
this conversion for the topological analyses and for the result 
visualization. 

Tissue specificity is a critical point to better comprehend 
and interpret the final results; an imprecise model affects 
the efficacy of the analyses. Unfortunately, current 
pathway databases represent pathways regardless of the 
cell type and tissue they occur in. Thus, the user has to 
be aware that the topology provided by Graphite web 
represents the integration of the information available in 
different experimental conditions. 

Apart from 'graphite', Graphite web uses other 
Bioconductor packages for pathway analysis (SPIA, 
sigPathway, globaltest, goseq, clipper), identification of 
DEGs (edgeR, limma), imputation of missing values 
(impute) among others. Graphite web will be automatic- 
ally updated every 6 months whenever the new 
Bioconductor release will be published. Cytoscape web 
(21) is used to provide an interactive view of the networks. 

Pathways browsing 

The Browse section allows the user to visualize genes 
(nodes) on pathways (networks), using a colour scale 
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proportional to the fold changes of the genes (if they are 
provided by the user). Gene IDs in the input (Gene IDs 
supported EntrezGene, Ensemble gene ID, HUGO 
Symbol), are automatically converted to EntrezGenes 
and mapped on all the pathways of the selected 
database. The following steps are required: 

Step 1: Select the organism. 
Step 2: Select the database. 

Step 3: Upload the input file or paste the input in a text 
box. 

Input files can be tab-delimited with two columns: the 
first column is the gene ID and the second column is 
the log fold change (optional) associated to the gene. In 
case only the expression directions are available (over/ 
under expression), the user can associate to the genes the 
values —1 and +1. 

The results are divided in sections reporting, respect- 
ively, (i) the table of all the pathways with at least one 
mapped gene, and for each pathway (ii) the interactive 
network-based visualization with nodes coloured accord- 
ing to the fold change provided. 

Analysis 

A brief overview of the gene set approaches implemented 
in Graphite web is reported below. According to the 
statistic used, each method is categorized as competitive 
or self-contained and topological or non-topological. 
For an extensive review and critical discussion see 
(16,22-27). 

Enrichment analysis ( competitive and non-topological ) 

Enrichment analysis is based on Fisher Exact test and es- 
timates the chance probability of observing a given 
number of genes of a particular pathway among the 
selected DEGs. For each pathway, a two-way contingency 
table is generated as follows: 
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where EEG means equally expressed genes, N is the total 
number of genes screened, G is the pathway and GC is the 
complement of G. N, and n ; are the frequencies of genes 
belonging to each table cell. Then, the probability P of 
observing at least n G ,d eg genes of a functional category 
within a group of N deg genes is given by: 
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Then, Ps are adjusted using Benjamini and Hochberg 
method (28). 

In case of RNA-seq count data, the statistical test 
widely used to identify DEGs is based on the negative 



binomial distribution (accounting for a quadratic depend- 
ence between mean and variance) (29). In this case, the 
read count defines the power of the test and given the strict 
dependence between read count and gene length, longer 
genes are characterized by a higher statistical power to be 
detected as differentially expressed. It has been shown that 
these differences in length/power if not properly assessed 
can introduce some bias in the final results (30). Graphite 
web allows an optional accounting for this bias using the 
P-value correction for the gene length as implemented in 
'goseq' Bioconductor package. 

Global test for group of genes ( self-contained and 
non-topological ) 

Global test is based on a penalized logistic regression 
model. The general idea is to find the genes within a 
pathway whose combination of expression profiles best 
predicts clinical data (subdivision in samples classes). In 
this model, the dependent variable is the vector of classes, 
while the covariates are the expression profiles of the genes 
belonging to the pathway. The model has a total number 
of parameters equal to the number of genes in the 
pathways. Typically, using gene expression data from 
high-throughput technology, the number of samples is 
much lower than the number of genes within a pathway. 
This type of model is defined non-identifiable, as we do 
not have a sufficient number of replicate (samples) to 
estimate the parameters. To cope with this unbalance 
structure of the data, Goeman et al. (7) proposed a 
penalized regression model where the coefficient of some 
genes are shrinked toward zero, reducing the number of 
parameters to be estimated. 

Gene set enrichment analysis ( self-contained or 
competitive and non-topological) 

GSEA was originally proposed by (8,9). The procedure is 
based on the following steps: (i) select a statistic to 
compare groups of samples (e.g. ?-test), (ii) rank the 
entire list of genes according to the value of this statistic, 
(iii) define a pathway G, and compare the distribution of 
the statistic of G and GC. 

In the original version GSEA, a weighted Kolmogorov- 
Smirnov (K-S) test was proposed for the comparison 
between G and GC test distributions, where the gene 
weights were given by the absolute value of the statistic. 
The significance of weighted K-S test was estimated 
through a permutational approach. The authors suggest 
permuting samples if the sample size is sufficiently large 
and to permute the genes otherwise. A normalization 
strategy was also proposed for the K-S statistic to take 
into account the pathway dimension. In these years, 
several improvements have been described using alternative 
ranking metrics, enrichment statistics and several variations 
of the significance estimation schemes; see (6,31-33) among 
others. Specifically, it was shown (6) that the differences in 
the correlation structure of each pathway could lead to a 
biased comparison among gene sets unless a normalization 
procedure is applied. To cope with this problem, Tian et al. 
(2005) proposed the use of the standard statistical approach 
for comparing mean shift of the G and the GC distribu- 
tions: a one-sample z-test with a permutational approach. 
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In particular, they described two different approaches: per- 
mutation of samples and permutation of genes. The first 
one leads to a self-contained test, the second one to a com- 
petitive test. 

Given t h the statistic of the gene i with i = 1 . . . N where 
N is the total number of genes, the two gene set statistics 
proposed by Tian et al. (2005) are as follows: 



with its null distributions generated by permuting {tj,..., 
with its null distributions generated by permuting samples 

{Z Z p }. 

It is important to note that although the formula for E k 
is the same as that of T k , their probability interpretations 
and hence their testing procedures are different. In T k , t t is 
deterministic and the gene set structure is random; in E k , 
the opposite is true (6). 

After a proper standardization, we obtain NT k and NE k 
statistics. The correlation structure in gene sets can still 
give false positives for NT k ; conversely, the gene set size 
can influence NE k . Then Tian et al. (2005) thus suggests 
taking as good candidate's pathways that are significant 
for both NT k and NE k . 

Graphite web implements Tian et al. (2005) GSEA 
statistics. 

Signalling pathway impact analysis ( competitive and 
topological ) 

The method proposed by Tarca et al. (2009) (5,34) calcu- 
lates a score through the combination of several aspects of 
the data: the fold change of the DEGs, the pathway en- 
richment score and the topology of signalling pathways. 
Specifically, from a topological point of view, SPIA 
enhances the impact of a pathway if the DEGs tend to 
lie near the entry points of a pathway (gene upstream of 
the pathway). 

SPIA needs as input the list of DEGs with their log fold 
changes and the complete list of gene names in the 
platform. SPIA then computes (i) the hypergeometric en- 
richment P-values, pNDE, (ii) a perturbation factor as a 
linear function of the perturbation factors of all genes in 
a given pathway, whose significance is calculated through a 
bootstrap approach, pPERT, and (iii) the combination of 
the two independent P- values (pNDE and pPERT), called 
pG. pGs are then adjusted for multiple testing using the 
false discovery rate (FDR) algorithm (35). 

Each pathway is finally marked as activated (positive 
perturbation score = positively perturbed) or the inhibited 
(or negatively perturbed) (5). 

Pathway analysis through Gaussian Graphical Models 
( CUPPER ) ( self-contained and topological ) 

Pathway dimensions are highly heterogeneous and we 
expect, from a biological point of view, that only some 



portions of a pathway would be involved, especially for 
large pathways. Among topological methods, however, 
none tries to identify the signal paths that are involved 
the most in the biological problem. 

In this perspective, our group has developed CliPPER 
(10,18), a totally new approach to fill this gap. Specifically, 
CliPPER is an empirical method based on Gaussian 
graphical models that (i) selects pathways with covariance 
matrices or means significantly different between experi- 
mental conditions; and (ii) on such pathways, identifies the 
portions of a pathway, called signal paths, associated the 
most with the phenotype. 

Different experimental conditions are usually compared 
in terms of their gene expression mean differences. 
However, the difference in mean expression levels does 
not necessarily result in a change of the interaction 
strength among genes. For example, a proportional 
increase of the expression of the genes A and B in one 
of two conditions will result in significantly different 
mean expression between the two conditions. The correl- 
ation strength between A and B, however, does not 
change. In this case, we would have pathways with signifi- 
cant altered mean expression levels but unaltered biolo- 
gical interactions. 

If, on the contrary, transcripts abundances ratios are 
altered, we expect a significant alteration not only of 
their mean expression levels, but also of the strength of 
their connections. That corresponds to a change in the 
biological activity that can be captured by the measuring 
the expression variance. 

CliPPER therefore searches for pathways strongly 
involved in a biological process by requesting that the 
mean or the variance of the expression levels result signifi- 
cantly altered between two conditions. 

Input files and data processing 

After selecting the type of analysis, the organism, the 
database and the type of data (microarray or RNA-seq), 
the user has (i) to choose the threshold for the minimum 
number of mapped gene within a pathway required for the 
pathway to be processed, (ii) to optionally insert the email 
for results notification and finally (iii) to upload input files. 
A scheme of the different data processes required by the 
analysis is shown in Figure 1. 

Graphite Web takes as input tab-delimited files. 
Depending on the analysis selected and the user setting, 
these files can be matrices (samples on the columns and 
genes on the rows) or lists of DEGs. 

Graphite web supports different types of IDs 
(EntrezGene, Ensemble gene ID and gene Symbol). 
Before the analysis begins, gene IDs are all converted, if 
necessary, to EntrezGene. 

Enrichment analysis and SPIA require two separate files 
(i) the list of DEGs with fold change and (ii) the whole list 
of genes. Graphite web gives two possibilities: (i) the 
separate upload of these files, when the user has already 
performed differential expression, or (ii) the upload of 
the normalized expression matrix with the automatic de- 
tection of DEGs using an empirical Bayes test [(28), for 
microarray data as implemented in 'limma' Bioconductor 



Nucleic Acids Research, 2013, Vol. 41, Web Server issue W93 



c 


Analysis Section 






Select Species 
Select Database 
Select Pathway Analysis 


/■ * -x f 


i 



Enrichment 
SPIA 



GSEA 
CliPPER 



With DECs 



Without DEGs 



Upload DEGs list with FC 

Upload all gene IDs 

■ . J v_ 



Upload expression 
matrix 



Upload expression 
matrix 



NA imputation 



NA imputation 



Square root 
transformation 



Global Test 



Upload expression 
matrix 



NA imputation 



ID conversion to EntrezGene 



J 



Giraphit-e Web 

pathway I analysis for gene expression cfqfcq 




Cvdln D asoCKefl tvtna in Gl 
Gl Phase 

EB Qualtty Control Compartment (ERQC) 



iti-apootpfc BCl-Z members 



Aiptia-oxicat'on Q< anrtanaie 

■n d tPCJC. Ktivatnr; hH»n fil IS and «rfv «i 





Figure 1. Schematic representation of Graphite web data processing according to the different analyses provided. 



package] or negative binomial test [(29), for RNA-seq 
data as implemented in 'edgeR' Bioconductor package]. 

Global test, GSEA and CliPPER require a normalized 
expression matrix. 

Expression matrices should be tab-delimited text files 
where the first row should contain sample names (the 
sample name represents the sample class) and the first 
column the gene IDs. In case of missing values (repre- 
sented by the 'NA' string) Graphite web automatically 
performs an imputation using k-nearest neighbor algo- 
rithm as implemented in 'impute' Bioconductor package. 



When the input data derives from an RNA-seq experi- 
ment, GSEA and CliPPER analyses are run over the 
square root transformed count data. It has been shown, 
indeed, that the square root of a Poisson-distributed 
variable is approximately normally distributed (36). 
Global test does not require any data transformations. 

Output and network visualization 

Graphite web removes from the analysis all the pathways 
with a number of mapped genes (intersection of genes in 
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the expression matrix and genes in the pathway) less 
than the user-defined threshold and generates a table re- 
porting the names of at least 50 top-scoring pathways. For 
each pathway, an interactive graph represents the gene- 
only network. Genes are colour-coded according to their 
contribution to the analysis. Grey nodes are genes not 
available in the platform/matrix uploaded; they are 
nodes that did not contribute to the analysis. Coloured 
nodes (with colour proportional to the log fold change) 
correspond to genes differentially expressed (for 
hypergeometric and SPIA analysis) or that strongly con- 
tribute to the significance of the pathway (for globaltest, 
GSEA and clipper). White nodes are those genes not dif- 
ferentially expressed (for hypergeometric and SPIA) or 
that contribute little to the significance of the pathway 
(for globaltest, GSEA and clipper). 

All the results can be downloaded (as a single zip file) 
and every analysis is stored at specific URL that can be 
accessed for 6 months from different IPs. 

For each significant pathway, Graphite web allows the 
download of (i) the PDF image of the network with nodes 
colour-coded according to their contribution to the 
analysis; (ii) a text file with the list of genes 
(EntrezGene, Symbol, Description and score) belonging 
to the pathway and used for the analysis; (iii) a text file 
with the list of genes (EntrezGene, Symbol, Description) 
belonging to the pathway, but not used for the analysis 
(not present in the list of genes provided by the user) (iv) a 
SIF file to load the pathway network in an external 
software such as Cytoscape (37). 

CASE STUDIES 

The cancer genome atlas colorectal cancer data 

As a benchmark case study, we used the gene expression 
data available from the TCGA project on colorectal 
cancer (CRC). Clinical information and normalized ex- 
pression profiles on 220 individuals were downloaded 
from https://tcga-data.nci.nih.gov/docs/publications/ 
coadread_2012/. 

CRC ranks the third and second among all commonly 
encountered malignancies in terms of incidence and mor- 
tality, respectively. The high mortality rate of advanced 
CRC can be attributed to limited treatment options. In 
this perspective, the stage of a cancer is one of the most 
important factors in determining prognosis and treatment 
options. Stage codification is based on how far the cancer 
has grown into the wall of the intestine, whether it has 
reached nearby structures and whether it has spread to 
the lymph nodes or distant organs. It is usually quoted 
as I, II, III, IV, where a higher number indicates a more 
advanced cancer and likely a worse outcome. CRC stage I 
indicates that cancer has begun to spread, but is still in the 
inner lining; stage II indicates that cancer has spread to 
other organs near the colon or rectum but it has not 
reached lymph nodes; stage III indicates that cancer has 
spread to lymph nodes, but has not been carried to distant 
parts of the body, while stage IV indicates metastasis. 
Given the importance of the comprehension of the mech- 
anisms that lead to the spread of the cancer on distant 



organs, in this example we focus on the transition 
between stage II and III. We select stage II and stage III 
from the whole cohort of patients obtaining a list of 137 
individuals (82 of stage II and 55 of stage III). We then 
performed all the analyses provided by Graphite web. 

It is worth noting that no DEGs have been identified 
using empirical Bayes test (38) (FDR < 0.1). Our example 
demonstrates that in such situations, high-level pathway 
analyses are a valuable alternative to detect moderate but 
coordinate expression/concentration alterations. Given 
the absence of DEGs we proceed only with global test, 
GSEA and CliPPER. 

Global test does not return significant pathways 
(FDR < 0.05). 

GSEA identified (44) significant Reactome pathways 
with NTk statistic (gene permutation strategy) and no 
pathways with NEIc statistic (sample permutation). In 
the home page of Graphite web, the complete lists of the 
significant pathways with the adjusted _P-value for KEGG 
and Reactome databases are available. 

Mechanism of CCR has not been fully characterized 
yet; however, it has been known that deregulation of cell 
cycle and apoptosis contribute to cancer progression and 
both this pathways are present in the analysis reported 
(39,40). In addition, almost all the top significant 
pathways are immune related. It has been shown that 
the balancing between the activation and the suppression 
of the host immune system against CCR play a key role 
determining the cancer progression (41). In particular, we 
find pathways involving NFKB and JAK-STAT sig- 
nalling, Toll-like receptors (TLRs) signalling, especially 
the MyD88-dependent cascades, and interleukins 
signalling pathways. It has been demonstrated that 
TLRs signalling directly promote and support intestinal 
carcinogenesis, and in fact a reduced expression of TLR4 
is associated with tumour growth and metastatic status 
(42). Finally the involvement of MyD88-dependent 
TLRs signalling in tumour growth and progression has 
been demonstrated both in mice model (43) and in CRC 
cell lines (44). On the other hand, inhibition of JAK1, 2/ 
STAT3 signalling induces apoptosis, cell cycle arrest and 
reduces tumour cell invasion in CRC cells (45). IL-6 is a 
multi-functional pro-inflammatory cytokine that has 
crucial roles in tumour progression through growth-pro- 
motion, anti-apoptotic activity and modulation of 
immune function, and thus is a strong candidate for 
mediating both local and systemic cancer-associated in- 
flammatory responses. It is of interest, therefore, that 
the IL-6/JAK/STAT pathway has emerged as a key 
player in cancer-associated inflammation (46). 

Exploiting the possibility offered by Graphite web to 
investigate the most influential genes within each 
pathway, we uncover a series of genes that perfectly cor- 
roborate previous observations. In particular, we find 
NFKB2, NFKBIA, JAK2, IL6, FOS, TLR4, PIK3CB, 
TNF, STAT3, STAT5, S100A12, IRAK3. 

Although CliPPER is based on a different null hypoth- 
esis with respect to GSEA, their results partly overlap (see 
the home page of Graphite web for the complete list of 
results). The statistical significance of this overlap for 
Reactome and KEGG results has been estimated using 
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Figure 2. A selection of CliPPER results on CRC data set. Three significant pathways are shown. Coloured nodes represent the portions of the 
pathways mostly involved in the progression from stage II to stage III identified by CliPPER. The colour of the nodes is proportional to their fold 
change (stage III versus stage II). 



hypergeometric distribution (P = 0.0003 for KEGG and 
P = 0.001 for Reactome). CliPPER identifies 80 
pathways (Reactome with means or concentration 
matrices significant altered in the two classes). Many of 
them are associated to TLRs (specifically TLR4), apop- 
tosis, cell cycles, but also NOTCH, Wnt and Hypoxia- 
inducible Factor signalling pathways. They are all 
known to have key roles in CRC. Figure 2 shows three 



of the most interesting pathways identified by CliPPER, 
where coloured nodes represent the portion of the 
pathway involved the most in the pathology, and the 
colours themselves are proportional to the gene fold 
change (stage III versus stage II). Notch signalling is an 
important molecular pathway involved in the determin- 
ation of cell fate. In recent years, this signalling has been 
frequently reported to play a critical role in maintaining 
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progenitor/stem cell population as well as a balance 
between cell proliferation, differentiation and apoptosis 
(47). Notch signalling is often and aberrantly activated 
by hypoxia during tumour progression. Specifically, the 
activation of Jagged2 by hypoxia in tumour cells 
induced epithelial to mesenchymal transition and it also 
promoted cell survival in vitro, playing a critical role in 
tumour progression and metastasis (Figure 2) (48). Many 
of the adaptations to hypoxia are mediated by the activa- 
tion of specific genes through hypoxia-inducible factor 
(HIF) such as HIF1 and HIF2 (also known as EPAS1). 
Their key regulatory subunits, HIF- la and HIF-2a, are 
induced similarly by hypoxia, but their functional roles in 
cancer may be distinct and isoform specific. Xenograft 
studies revealed that HIF- lot deficiency inhibited overall 
tumour growth, whereas deficiency of HIF-2a stimulated 
tumour growth (Figure 2) (49). 

As a practical example of CUPPER demonstrating its 
usefulness during the interpretation of results, we investi- 
gate in detail the TLR4 pathway that is known to be 
involved in CRC. 

CliPPER highlights two TLR4-mediated signalling 
pathways that correspond to two-coloured path of the 
network. The first one is the MAPK area with 
(MAPK1, MAPK3, MAPK7, some subunits of protein 
phosphatase 2 and some dual specificity phosphatase, 
DUSP). The ERK/MAPK pathway is one of the most 
important for cell proliferation, and its overexpression 
and activation are commonly detected in CRC. Several 
evidence indicates that overexpression and activation of 
ERK MAPK play an important part in the progression 
of this cancer (50). The second one is the TRL4 area (with 
MYOD88, TRL3, IRAKs and TRAFs genes) that, as 
reported before, is perfectly coherent with CRC 
progression. 

While a complete investigation of biological relevance 
of all the results reported by Graphite web is beyond the 
scope of this work, these results highlight the usefulness 
and power of the tool even in cases where biological 
groups are highly similar and classical inferential 
approaches fail to provide new insights. 



CONCLUSIONS 

Pathway analysis aims at identifying groups of function- 
ally related genes that show coordinated expression/con- 
centration changes. Recently, pathway analyses moved 
from algorithms using mere gene lists to new ones exploit- 
ing the topology that define gene connections. 
Unfortunate limits to the use of these new methods are 
(i) the availability of the pathways as gene networks in 
which nodes are only genes, (ii) a user-friendly access to 
topological statistical analysis usually implemented in the 
R language. Graphite web has been developed to face 
both issues. The core of Graphite web is graphite (17), a 
tool developed by our group for the storage, interpret- 
ation and conversion of pathway topology to gene-only 
networks using biological-driven rules. Graphite web im- 
plements a totally new system of pathway visualization 
and provides an easy access to multivariate and 



topological pathway analyses. The combination of a 
pathway-specific visualization with powerful gene set 
analyses gives to the user the possibility to explore in 
great detail signalling pathways and the position of the 
influential genes within them. 
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